Author : Indumathi Pandiyan

Featurization & Model Tuning(FMT) Project - 5th Project submitted for PGP-AIML Great Learning on 06-Feb-2022

• DOMAIN:Semiconductor manufacturing process

• CONTEXT: A complex modern semiconductor manufacturing process is normally under constant surveillance via the monitoring of signals/variables collected from sensors and or process measurement points. However, not all of these signals are equally valuable in a specific monitoring system. The measured signals contain a combination of useful information, irrelevant information as well as noise. Engineers typically have a much larger number of signals than are actually required. If we consider each type of signal as a feature, then feature selection may be applied to identify the most relevant signals. The Process Engineers may then use these signals to determine key factors contributing to yield excursions downstream in the process. This will enable an increase in process throughput, decreased time to learning and reduce the per unit production costs. These signals can be used as features to predict the yield type. And by analysing and trying out different combinations of features, essential signals that are impacting the yield type can be identified.

• DATA Description:
sensor-data.csv : (1567, 592) The data consists of 1567 datapoints each with 591 features. The dataset presented in this case represents a selection of such features where each example represents a single production entity with associated measured features and the labels represent a simple pass/fail yield for in house line testing. Target column “ –1” corresponds to a pass and “1” corresponds to a fail and the data time stamp is for that specific test point

PROJECT OBJECTIVE:
We will build a classifier to predict the Pass/Fail yield of a particular process entity and analyse whether all the features are required to build the model or not.

Steps and tasks: [ Total Score: 60 points]

1.Import and understand the data. [5 Marks]

A.Import ‘signal-data.csv’ as DataFrame. [2 Marks]

Observation:

There are are 1567 samples and 592 features /attributes in Signal data

Observation:

In the features 590 are float and one integer type and one is object type

B.Print 5 point summary and share at least 2 observations. [3 Marks]

Observations

  1. From the above five point summary its observed that few features having varying counts ranges from 1553 to 1566. It shows presence of empty or null values
  2. Pass adn Fail shows minimum as -1.0 and maximum as 1, it shows fail is represented as -1 and pass as 1.
  3. For the shown features 0 to 3 and 587, 588 mean is almost equal to median which shows these features are normally distributed
  4. the feature number 4 has all values has very less value where as maximum is 1114.53 it clearly indicates the outlier presence

2. Data cleansing: [15 Marks]

A. Write a for loop which will remove all the features with 20%+ Null values and impute rest with mean of the feature. [5 Marks]

Method to remove the null columns which have more than 20%

Comments : 32 features which has more than 20% null value are removed

Imputing with Mean for other null values

Comments:

B.Identify and drop the features which are having same value for all the rows. [3 Marks]

Code to identify same value columns

C.Drop other features if required using relevant functional knowledge. Clearly justify the same. [2 Marks]

Comments:

D.Check for multi-collinearity in the data and take necessary action. [3 Marks]

Comments

By analysis its found that more features had multicollinearilty hence decided to check correlation value and remove them and applying the VIF to remove high multicollinear columns

  1. Find correlation and remove columns having more correlation than 0.7
  2. Find multicollinear columns having more than threshold of 10

Comments: 233 columns has more correation than 0.7

Method to verify the VIF Threshold and return the features that have high VIF than the specified threshold

Removing the multicollinear columns from the dataset

Comments

E.Make all relevant modifications on the data using both functional/logical reasoning/assumptions. [2 Marks]

Observation:

Hence dropping this column

Comments :

comments: Replacing -1 to 0 for easy understanding of failure scenarios

Handling the Outlier

Method to handle Outlier by Capping method

Box plot before outlier handling

Comments

After capping method the outliers are removed.

Comments:

By capping method all the outliers are removed from the chosenn columns of Signaldata dataset

3. Data analysis & visualisation: [5 Marks]

A. Perform a detailed univariate Analysis with appropriate detailed comments after each analysis. [2 Marks]

For univariate analysis there are various ways. following are the common ways

Univariate Analysis

comments

  1. After preprocessing there are 1567 records and 82 features
  2. All null columns are removed
  3. There are 81 float feature target variable (Pass/Fail ) is integer

Comments
Mostly standard distribution is observed. few measurements show bimodal distribution and right skewed.

Piechart for Pass / Fail distribution across that

Comments:
The target value is highly imbalanced. Only around 7% data has the expected target of "Pass" value.

Box plot after outlier removal for understanding the data distribution

seggregating the data for better visualization

B. Perform bivariate and multivariate analysis with appropriate detailed comments after each analysis. [3 Marks]

Bivariate Analysis

Observation:

* For analysing the data based on the Target value of Pass/ Fail bivariate analysis is done by box plot analysis <br>
* Blue represents fail points and orange represent pass point<br>
* General obsevation is pass data points has high IQR than the fail points for most of the points<br>
* Median point is almost same for both kind of data points for most of the features<br>
* Data points are not in different ranges<br>

Observation
Median is same for Pass and Fail value for the feature 4 and failed feature is left skewed

Observation
Median is same for Pass and Fail value for the feature 9 and both boxplots are normally distributed. IQR is high in Failed cases.

Observation :
For more analysis few of the features are visualuazied in detail for analysing
Median is same for Pass and Fail value for the feature 4 and passed feature is Right skewed

Comments:
For better visualisation the screened dataset is grouped into 4 subets and viewed with pass /fail varible as hue. And its observed that most data is available in cloud and mosr or not correlated. Pattern wise, mostly normal distribution few binodal distribution also available.

Multivariate analysis

Comments:
The mulivariate visualization plot clearly shows that there very few data points are correlated after the feature engineering or after removal of noise.

4. Data pre-processing: [10 Marks]
A. Segregate predictors vs target attributes. [2 Marks]

B. Check for target balancing and fix it if found imbalanced. [3 Marks]

Comments:
Failed Data(93%) is more compared to Pass Data(nearly 7%).Target value is imbalanced.

Comments:
By SMOTE the Train data is oversampled to balanced to contribute equally to minority and majority class

C. Perform train-test split and standardise the data or vice versa if required. [3 Marks]

Comments:

D. Check if the train and test data have similar statistical characteristics when compared with original data. [2 Marks]

Comments

5. Model training, testing and tuning: [20 Marks]

A.Use any Supervised Learning technique to train a model. [2 Marks]

For Base model using Logistic Regression Model

Comments
With the Logistic Regression as base model got 80% accuracy

B.Use cross validation techniques. [3 Marks]
Hint: Use all CV techniques that you have learnt in the course

Comments

Cross validation on training set gives the Accuracy of 79.901 accuracy with standard deviation of 8.208 percentage.
Hence the model will perform with the range of 79.901+/-2( 5.20)
That is it will give accurate result between 69.5 and 91.36 percent accuracy for 95% of confidence

LOOCV technique

Comments

C.Apply hyper-parameter tuning techniques to get the best accuracy. [3 Marks]

Randomized Search CV

Comments:

With Random seach CV the accuracy increased to 80%.

D. Use any other technique/method which can enhance the model performance. [4 Marks] Hint: Dimensionality reduction, attribute removal, standardisation/normalisation, target balancing etc.

E.Display and explain the classification report in detail. [3 Marks]

Comments:

F.Apply the above steps for all possible models that you have learnt so far. [5 Marks]

The method will get the Models,Predictors and targets and input and output with the classification metrics

All the classification models with Standardized Test data is passed to get the Classifcation report

6.Post Training and Conclusion: [5 Marks]

A. Display and compare all the models designed with their train and test accuracies. [1 Marks]

Obeservation

B. Select the final best trained model along with your detailed comments for selecting this model. [1 Marks]

Comments : ROC curve shows relationship between False Positive Rate and True Positive Rate across different thresholds.The are under curve shows the models is more predictive.

C. Pickle the selected model for future use. [2 Marks]

Load Saved Model

D. Write your conclusion on the results. [1 Marks]

Conclusion:

Through the proper feature selection and Data preparation following are the advantages.

In this project, as per the instructions given, the unwanted features are removed (like removing features having more null value, features with same data are removed. And the high correlated features and multicollinear columns which has more than 10 percent threshold are removed.Multicollinear features result in less reliable statistical inference.
With all these steps around 82 features are retained.Which is reasonable for model building.

Predictors and Target features are seperated and it is been observed that the data is highly imbalanced. Through SMOTE, the data is balanced.To avoid Data Leakage, Test and Train data are seperated and then standardized for model building.All these steps helped in reducing computational time with the provided multidimensional feature.

And Base Model is built and model is been tuned with Hyper parameter techniques and expected performance in production is calculated throgh kfold cross validation. And all these steps repeated for different models best model is selected.For this semiconductor data SVM performs best among all the model in terms of accuracy and Recall. It performs equally well in test as in Training.

Hence, its been concluded that by the "Featurization, Model selection and Tuning", with the multidimensional data, the best model is identified as Support Vector Classifier with (C=10, gamma=0.01,kernel=rbf)-"Radial Basis Function" and it can give upto 99.9%* accuracy in test data and 95% confidence of performing 98.5 to 100%** in production based on cross validation.